Salary Prediction¶

1 - Frame the Problem¶

The problem exists that an organisation/company sometimes when new employees join the organisation, it can be difficult to categorise what type of pay scale that employee should fall under and requires a lot of negotiating which can be very time consuming. The organisation requires a way so that it can decide what salary boundary their new hires should fall under without the need for negotiation and use of time.

To address this problem, the use of a machine learning classification model is used. Within this model there exists a target feature called salary which categorises employee’s salary into either a boundary of above 50k or below 50k. Using this classification model, it is trained using the employee’s salary data set which holds information relevant to employment which would affect salaries.

  • For each employee, the dataset contains information on what salary group the employees fall under. This data set is used to train our model to correctly predict employees into their respective salary boundaries.
  • Once the model is trained the model is given new employee salary data which it has not seen before based on the features in our dataset.
  • The aim is to allow the model to correctly predict what salary the employee will fall under with accurate results. This will allow to overcome the problem the organisation is facing.

The organisation can use the classification model built and provide it with relevant employee data based on features in the dataset to answer their questions on what salary boundary this employee should fall under for their new employees.

In [1]:
# Importing libraries

# Data processing, 
import numpy as np
import pandas as pd

# Data visualisation
import matplotlib.pyplot as plt
import seaborn as sns             
%matplotlib inline

# Imputation
from sklearn.impute import SimpleImputer

import warnings
warnings.filterwarnings('ignore')

# External Graphing Imports
import plotly.express as px

# Modelling Imports
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_predict, cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, roc_auc_score
from sklearn.preprocessing import LabelEncoder

#Individual Module Files
from modules.PlotlyGraphsModule_KS_29007147 import PlotlyGraphs_KS_29007147
from modules.CheckingDuplicatesModule_HS_29012930 import CheckingDuplicateValues_HS_29012930
from modules.scale_the_data_RG_29014027 import scale_the_data
from modules.RemoveCorrelatedInputs_AS_29020256 import remove_correlated_inputs
from modules.Normalisation_AS_29020256 import normalisation
from modules.SplitGraphs_RR_29003671 import perform_eda
from modules.Graphs_RR_29003671 import Graphs

Installation Method¶

pip install plotly

2 - Collect Data¶

Extraction was done by Barry Becker from the 1994 Census database.
https://www.kaggle.com/datasets/ayessa/salary-prediction-classification

In [2]:
# Loading file

df = pd.read_csv("./salary.csv")

df.head()
Out[2]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K

3 - Exploratory Data Analysis¶

General Statistics¶

Information about the dataset.

In [3]:
# Obtaining informaiton about the dataset rows and columns, to check the amount of records and features.
total_rows = df.shape[0]
total_columns = df.shape[1]
missing_values = sum(df.isna().sum())
duplicate_values = df.duplicated().sum()
data_types = df.dtypes

print(f"Total number of instances(rows) in dataset: {total_rows}")
print(f"Total number of features(columns) in dataset: {total_columns}")
print(f"Missing Values: {missing_values}")
print(f"Duplicate Values: {duplicate_values}\n")

data_types
Total number of instances(rows) in dataset: 32561
Total number of features(columns) in dataset: 15
Missing Values: 0
Duplicate Values: 24

Out[3]:
age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
salary            object
dtype: object

Checking labels for each categorical feature¶

We check the different unique values that exist in each feature of our data set. The reasoning is sometimes datasets may contain unusual values as seen here like question mark values which count towards missing data. By doing this check we are seeing if our data contains anything unusual that we may need to clean in pre-processing.

In [4]:
# Identifiying Categorical features and associated labels

for col in df.columns:
    if df[col].dtype=='object':
        print()
        print(col)
        print(df[col].unique())
workclass
[' State-gov' ' Self-emp-not-inc' ' Private' ' Federal-gov' ' Local-gov'
 ' ?' ' Self-emp-inc' ' Without-pay' ' Never-worked']

education
[' Bachelors' ' HS-grad' ' 11th' ' Masters' ' 9th' ' Some-college'
 ' Assoc-acdm' ' Assoc-voc' ' 7th-8th' ' Doctorate' ' Prof-school'
 ' 5th-6th' ' 10th' ' 1st-4th' ' Preschool' ' 12th']

marital-status
[' Never-married' ' Married-civ-spouse' ' Divorced'
 ' Married-spouse-absent' ' Separated' ' Married-AF-spouse' ' Widowed']

occupation
[' Adm-clerical' ' Exec-managerial' ' Handlers-cleaners' ' Prof-specialty'
 ' Other-service' ' Sales' ' Craft-repair' ' Transport-moving'
 ' Farming-fishing' ' Machine-op-inspct' ' Tech-support' ' ?'
 ' Protective-serv' ' Armed-Forces' ' Priv-house-serv']

relationship
[' Not-in-family' ' Husband' ' Wife' ' Own-child' ' Unmarried'
 ' Other-relative']

race
[' White' ' Black' ' Asian-Pac-Islander' ' Amer-Indian-Eskimo' ' Other']

sex
[' Male' ' Female']

native-country
[' United-States' ' Cuba' ' Jamaica' ' India' ' ?' ' Mexico' ' South'
 ' Puerto-Rico' ' Honduras' ' England' ' Canada' ' Germany' ' Iran'
 ' Philippines' ' Italy' ' Poland' ' Columbia' ' Cambodia' ' Thailand'
 ' Ecuador' ' Laos' ' Taiwan' ' Haiti' ' Portugal' ' Dominican-Republic'
 ' El-Salvador' ' France' ' Guatemala' ' China' ' Japan' ' Yugoslavia'
 ' Peru' ' Outlying-US(Guam-USVI-etc)' ' Scotland' ' Trinadad&Tobago'
 ' Greece' ' Nicaragua' ' Vietnam' ' Hong' ' Ireland' ' Hungary'
 ' Holand-Netherlands']

salary
[' <=50K' ' >50K']

Counts for each Feature¶

Each features number of unique values is checked. This to ensure there is different possible values for each feature. If a feature has only 1 possible value for each row it would mean that this data is not necessary for our models use and would not have an affect on our models performance so would be dropped ensuring our model power and resources are saved. In our problem this was not the case as seen as above.

In [5]:
def summarising(df):# define a function that takes the dataframe as input for the salary classification problem 
    summarising_df = pd.DataFrame({ # creating a new dataframe to store our required info 
        'uniques': df.nunique() # calculates the number of unique values for a collumn.
    })
    return summarising_df

summarising(df)
Out[5]:
uniques
age 73
workclass 9
fnlwgt 21648
education 16
education-num 16
marital-status 7
occupation 15
relationship 6
race 5
sex 2
capital-gain 119
capital-loss 92
hours-per-week 94
native-country 42
salary 2

Checking Values¶

Now another pre-processing step I would like to take is to check if there missing values in our data set although there may not be any this is still an important pre-processing step that needs to be taken now lets check if there are any missing values in our data set as this is an important pre-processing step

In [6]:
CheckingDuplicateValues_HS_29012930(df)
Missing Values: 0
Duplicate Values: 24

Graphs (Plotly - External Package)¶

The external library used is Plotly, for creating graphs and visualisations. It provides tools that allow for the creation of scatter plots, line plots, bar charts, pie charts and various other visualisations as required. In this case, plotly was used to analyse the data and how the salaries were distributed between genders.


Gender Distribution:
The dataset is skewed towards males, with 21,790 male individuals compared to 10,771 females. This gender bias may impact the accuracy of any salary-related predictions.
Salary Distribution:
The majority of individuals (24,720) have salaries at or below 50,000. Only 7,841 individuals earn more than 50,000. The dataset is imbalanced in favor of lower salaries.
Gender-Salary Relationship:
The two bar charts showing salary distribution for males and females both exhibit male dominance. However, this could be influenced by the overall gender distribution in the dataset. In summary, the dataset contains more male data points, and most individuals earn salaries below 50,000. Keep these factors in mind when interpreting any predictions or conclusions based on this data.

In [7]:
PlotlyGraphs_KS_29007147(df)

Correlation Matrix¶

A correlation matrix is a table that shows the strength and direction of the linear relationship between pairs of variables in a dataset.

  • Age and Education: The correlation coefficient between age and education is likely to be positive (around 0.48 in this case). This suggests that older people in this dataset tend to have higher levels of education.
  • Age and Workclass: The correlation between age and workclass is likely to be weak (around 0.06 in this case). This suggests there's not a strong linear relationship between age and work type in this dataset.
  • Age and FnLWgt: FnLWgt might represent a weighted factor related to income or population. The weak positive correlation (around 0.08) suggests that older people might be associated with higher weights in this factor.
  • Education and Education_Num: These two variables likely represent similar concepts (educational level measured in different ways). As expected, the correlation is very strong and positive (around 0.8).
In [28]:
df_copy = df.copy()

labelencoder=LabelEncoder()
for column in df.columns:
    df_copy[column] = labelencoder.fit_transform(df_copy[column])
df_copy.head()

correlation_matrix = df_copy.corr().abs()
correlation_matrix

plt.figure(figsize=(10,8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Matrix")
plt.show()

4 - Data Preprocessing¶

Handling Duplicated Rows¶

Duplicate rows can have a negative influence on the dataset, the code below drops them from the dataset. The total number of rows removed is 24, which is less than 1% of the total dataset. The number of duplicated rows in the dataset was comparitively small. Dropping the duplicated rows left us with 0 data to train the model, which is still sufficient for making accurate predictions.

In [29]:
## Let's get rid of duplicate entries

df.drop_duplicates(keep='first',inplace=True)

rows = len(df.index)
columns = len(df.columns)
duplicate_values = df.duplicated().sum()

print(f"Rows: {rows}")
print(f"Columns: {columns}")
print(f"Duplicate Values: {duplicate_values}")
Rows: 32537
Columns: 15
Duplicate Values: 0

The data shows that workclass and occupation are correlated as one when data is missing the other is also missing.

Checking for '?' in the Dataset¶

  • From analysying the data there are several labels with '?', which are in workclass, occupation, and native-country.
  • As well as spaces in front of labels through out the dataset.
In [30]:
# Removing extra spaces
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)

# Repalces ? symbols to nan
df.replace('?', np.nan, inplace=True)

# Counting Nan values
print("Number of Nan Values:")
pd.isna(df).sum()[pd.isna(df).sum() > 0]
Number of Nan Values:
Out[30]:
workclass         1836
occupation        1843
native-country     582
dtype: int64

The data shows that workclass and occupation are correlated as one when data is missing the other is also missing.

Mode Imputation¶

Mode imputation was employed to address missing values in categorical features like 'workclass' 'occupation' and 'native-country. This technique replaces missing entries with the most frequent value within each category. This helps improve the dataset for modeling by ensuring all features have complete data, allowing machine learning algorithms to learn more effectively from the patterns within the dataset.

In [31]:
# Mode Imputation to deal with NaN values in categorical features

# Create a SimpleImputer object using the 'most_frequent' strategy
imputer = SimpleImputer(strategy='most_frequent')

# Define the features to be imputed (categorical in this case)
features_to_impute = ['workclass', 'occupation', 'native-country']

# Select the features with missing values for imputation (from training data)
X_train = df[features_to_impute]

# Fit the imputer to learn the most frequent values in each category (using training data)
imputer.fit(X_train)

# Select the features with missing values for imputation (from testing data)
X_test = df[features_to_impute]

# Transform the test data by replacing missing values with the most frequent values
# learned from the training data
X_test_imputed = imputer.transform(X_test)

# Create a copy of the original DataFrame to avoid modifying the original data
df_updated = df.copy()

# Replace the missing values in the specified features with the imputed values
df_updated[features_to_impute] = X_test_imputed

# Print information about the updated DataFrame
rows = len(df_updated.index)
columns = len(df_updated.columns)
names_types = df_updated.dtypes
missing_values = sum(df_updated.isna().sum())
duplicate_values = df_updated.duplicated().sum()

print(f"Rows: {rows}")
print(f"Columns: {columns}")
print(f"Missing Values: {missing_values}")
print(f"Duplicate Values: {duplicate_values}\n")
print(names_types)

# Display the first 5 rows of the updated DataFrame
df_updated.head(5)
Rows: 32537
Columns: 15
Missing Values: 0
Duplicate Values: 0

age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
salary            object
dtype: object
Out[31]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K

Renaming Labels¶

The code block renames features into more understandable names which reduces complexity.

In [32]:
 # Renaming Labels
df_updated.replace({'workclass': {'State-gov': 'Govt.', 'Self-emp-not-inc': 'self_emp', 'Federal-gov': 'Govt.', 'Local-gov': 'Govt.', 'Self-emp-inc':'self_emp', 'Without-pay': 'Broke', 'Never-worked': 'Broke'}}, inplace=True)

df_updated.replace({'marital-status': {'Married-civ-spouse': 'Married', 'Divorced': 'DASW', 'Married-spouse-absent': 'DASW', 'Separated': 'DASW', 'Married-AF-spouse':'Married', 'Widowed': 'DASW'}}, inplace=True)

df_updated.replace({'occupation': {'Adm-clerical': 'Adminstration', 'Exec-managerial': 'Executive', 'Handlers-cleaners': 'Handlers', 'Prof-specialty': 'Professionals', 'Other-service' : 'Other', 'Craft-repair' : 'Repairing', 'Farming-fishing' : 'Farming', 'Transport-moving':'Transportation', 'Machine-op-inspct': 'MachineOp', 'Protective-serv' : 'ProtectiveServ', 'Priv-house-serv': 'HouseServ'}}, inplace=True)

df_updated.replace({'native-country': {'United-States': 'USA', 'South': 'SouthKorea', 'Puerto-Rico': 'PuertoRico', 'Dominican-Republic': 'DominicRep', 'Outlying-US(Guam-USVI-etc)':'OutlyingUSA', 'Trinadad&Tobago': 'Tri&Tob', 'Holand-Netherlands': 'Netherlands', 'Hong' : 'HongKong'}}, inplace=True)

df_updated.replace({'race': {'Asian-Pac-Islander': 'APAC', 'Amer-Indian-Eskimo': 'NatAm'}}, inplace=True)

# Checking Labels for each feature
for col in df_updated.columns:
    if df_updated[col].dtype=='object':
        print()
        print(col)
        print(df_updated[col].unique())
workclass
['Govt.' 'self_emp' 'Private' 'Broke']

education
['Bachelors' 'HS-grad' '11th' 'Masters' '9th' 'Some-college' 'Assoc-acdm'
 'Assoc-voc' '7th-8th' 'Doctorate' 'Prof-school' '5th-6th' '10th'
 '1st-4th' 'Preschool' '12th']

marital-status
['Never-married' 'Married' 'DASW']

occupation
['Adminstration' 'Executive' 'Handlers' 'Professionals' 'Other' 'Sales'
 'Repairing' 'Transportation' 'Farming' 'MachineOp' 'Tech-support'
 'ProtectiveServ' 'Armed-Forces' 'HouseServ']

relationship
['Not-in-family' 'Husband' 'Wife' 'Own-child' 'Unmarried' 'Other-relative']

race
['White' 'Black' 'APAC' 'NatAm' 'Other']

sex
['Male' 'Female']

native-country
['USA' 'Cuba' 'Jamaica' 'India' 'Mexico' 'SouthKorea' 'PuertoRico'
 'Honduras' 'England' 'Canada' 'Germany' 'Iran' 'Philippines' 'Italy'
 'Poland' 'Columbia' 'Cambodia' 'Thailand' 'Ecuador' 'Laos' 'Taiwan'
 'Haiti' 'Portugal' 'DominicRep' 'El-Salvador' 'France' 'Guatemala'
 'China' 'Japan' 'Yugoslavia' 'Peru' 'OutlyingUSA' 'Scotland' 'Tri&Tob'
 'Greece' 'Nicaragua' 'Vietnam' 'HongKong' 'Ireland' 'Hungary'
 'Netherlands']

salary
['<=50K' '>50K']

Counts for Each Country¶

In [33]:
# Checking counts for each Country

df_updated['native-country'].value_counts()
Out[33]:
USA            29735
Mexico           639
Philippines      198
Germany          137
Canada           121
PuertoRico       114
El-Salvador      106
India            100
Cuba              95
England           90
Jamaica           81
SouthKorea        80
China             75
Italy             73
DominicRep        70
Vietnam           67
Japan             62
Guatemala         62
Poland            60
Columbia          59
Taiwan            51
Haiti             44
Iran              43
Portugal          37
Nicaragua         34
Peru              31
France            29
Greece            29
Ecuador           28
Ireland           24
HongKong          20
Cambodia          19
Tri&Tob           19
Laos              18
Thailand          18
Yugoslavia        16
OutlyingUSA       14
Honduras          13
Hungary           13
Scotland          12
Netherlands        1
Name: native-country, dtype: int64

USA is top country which has the highest number of records this is around 90% of the total dataset.

Splitting dataset into USA & Non-USA¶

Chosen to split the dataset for EDA as USA has majority of the dataset, therefore splitting dataset allows for better dsitribution and analysis of all the records for other countries.

In [34]:
 # Splitting dataset into USA and Non-USA

USA = df_updated[df_updated['native-country'] == 'USA']
NonUSA = df_updated[df_updated['native-country'] != 'USA']

print('USA', USA.shape)
print('Non-USA', NonUSA.shape)
USA (29735, 15)
Non-USA (2802, 15)
In [38]:
# Perform EDA on USA data
perform_eda(df_updated, "Total")
    
# Perform EDA on USA data
perform_eda(USA, "USA")

# Perform EDA on Non-USA data
perform_eda(NonUSA, "Non-USA")
  • Imbalanced dataset, as over 75% of records are <=50K salary Where as >50K is just around 24%.
  • USA has a higher number of over 50K salaried individuals around 24% which is similar to the total overall percentage.
  • Non-USA <=50K segment is higher than the overall toal <=50K segment at 81.3%
In [55]:
# Numerical Analysis (Age & Hours/Week)

plt.subplots(figsize=(15,10))

plt.subplot(2,2,1)
plt.title('Age of the Individual : Histogram',fontsize=16)
sns.distplot(df_updated.age, bins=73)
plt.ylabel(None), plt.yticks([]), plt.xlabel(None)

plt.subplot(2,2,2)
plt.title('Hours / Week: Histogram', fontsize=16)
sns.distplot(df_updated['hours-per-week'], color='#40E0D0', bins=98)
plt.ylabel(None), plt.yticks([]), plt.xlabel(None)

plt.subplot(2,2,3)
plt.title('Age of the Individual : Box & Whisker Plot', fontsize=16)
sns.boxplot(df_updated['age'], orient='h',color="#c7e9b4")

plt.subplot(2,2,4)
plt.title('Hours / Week: Box & Whisker Plot', fontsize=16)
sns.boxplot(df_updated['hours-per-week'], orient='h', color="#c7e9b4")

plt.show()

Using box & whisker plots and histograms to understand distrubtion and identify outliers.

  • Age: The age distribution exhibits a tail towards older individuals, with some entries exceeding 70 years old. This might warrant further investigation to assess data quality or potential inconsistencies.
  • Hours per Week: The data shows a presence of outliers in the working hours, with individuals reported to be working 70 hours or more per week. While this could be plausible for certain professions or self-employed individuals, it's crucial to scrutinize these extreme values to ensure their accuracy and potential impact on analysis.
In [41]:
Graphs(df_updated)

Cetegorical Data to Numerical¶

We employed LabelEncoder to convert categorical values within our dataset into numerical representations. This technique assigns a unique integer to each category within a feature, enabling algorithms that require numerical input to process the data effectively. By applying LabelEncoder to all columns in the DataFrame, we've transformed the entire dataset into numerical form for further analysis and modeling.

In [42]:
# Using LabelEncoder to convert catergory values to ordinal

labelencoder=LabelEncoder()
for column in df_updated.columns:
    df_updated[column] = labelencoder.fit_transform(df_updated[column])
df_updated.head()
Out[42]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 22 1 2671 9 12 2 0 1 4 1 25 0 39 38 0
1 33 3 2926 9 12 1 2 0 4 1 0 0 12 38 0
2 21 2 14086 11 8 0 4 1 4 1 0 0 39 38 0
3 36 2 15336 1 6 1 4 0 1 1 0 0 39 38 0
4 11 2 19355 9 12 1 8 5 1 0 0 0 39 4 0

Correlated Input¶

In [43]:
correlation_matrix = df_updated.corr().abs()
correlation_matrix
Out[43]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
age 1.000000 0.058500 0.078184 0.010539 0.036244 0.475668 0.002769 0.263828 0.027600 0.088740 0.125907 0.065035 0.068878 0.000507 0.234131
workclass 0.058500 1.000000 0.021546 0.025152 0.070432 0.009118 0.061881 0.087260 0.077088 0.114771 0.037712 0.019732 0.096918 0.017298 0.026338
fnlwgt 0.078184 0.021546 1.000000 0.026957 0.042957 0.033027 0.008916 0.006958 0.039421 0.025985 0.004586 0.009901 0.019313 0.069017 0.010605
education 0.010539 0.025152 0.026957 1.000000 0.359085 0.002876 0.059741 0.011057 0.013969 0.027433 0.031448 0.016157 0.056784 0.077512 0.079366
education-num 0.036244 0.070432 0.042957 0.359085 1.000000 0.017983 0.058692 0.094432 0.029606 0.012205 0.154414 0.084144 0.150402 0.091344 0.335272
marital-status 0.475668 0.009118 0.033027 0.002876 0.017983 1.000000 0.014520 0.042539 0.017139 0.074297 0.047855 0.026197 0.110990 0.009382 0.106078
occupation 0.002769 0.061881 0.008916 0.059741 0.058692 0.014520 1.000000 0.134775 0.027019 0.195508 0.009115 0.002675 0.018043 0.004515 0.004927
relationship 0.263828 0.087260 0.006958 0.011057 0.094432 0.042539 0.134775 1.000000 0.123128 0.582594 0.093197 0.064319 0.251253 0.010479 0.250948
race 0.027600 0.077088 0.039421 0.013969 0.029606 0.017139 0.027019 0.123128 1.000000 0.096192 0.027443 0.017728 0.046902 0.119721 0.072093
sex 0.088740 0.114771 0.025985 0.027433 0.012205 0.074297 0.195508 0.582594 0.096192 1.000000 0.077602 0.049550 0.231232 0.000828 0.215969
capital-gain 0.125907 0.037712 0.004586 0.031448 0.154414 0.047855 0.009115 0.093197 0.027443 0.077602 1.000000 0.057015 0.101346 0.013882 0.340019
capital-loss 0.065035 0.019732 0.009901 0.016157 0.084144 0.026197 0.002675 0.064319 0.017728 0.049550 0.057015 1.000000 0.058805 0.010404 0.162494
hours-per-week 0.068878 0.096918 0.019313 0.056784 0.150402 0.110990 0.018043 0.251253 0.046902 0.231232 0.101346 0.058805 1.000000 0.006495 0.232365
native-country 0.000507 0.017298 0.069017 0.077512 0.091344 0.009382 0.004515 0.010479 0.119721 0.000828 0.013882 0.010404 0.006495 1.000000 0.023652
salary 0.234131 0.026338 0.010605 0.079366 0.335272 0.106078 0.004927 0.250948 0.072093 0.215969 0.340019 0.162494 0.232365 0.023652 1.000000
In [44]:
plt.figure(figsize=(10,8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Matrix")
plt.show()
  • As we can see there is a high correlation between sex and relationship. This can be explored when building our logtistic regression model. Potentially remove one of these collumns and see how the model performs
  • As part of in-depth asnalysis of the model we will remove one of these columns and explore the performance of the logistic regression model.

Removing Correlated Inputs¶

In [24]:
clean_data = remove_correlated_inputs(df_updated.copy())
clean_data.head(10)
Out[24]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 22 1 2671 9 12 2 0 1 4 1 25 0 39 38 0
1 33 3 2926 9 12 1 2 0 4 1 0 0 12 38 0
2 21 2 14086 11 8 0 4 1 4 1 0 0 39 38 0
3 36 2 15336 1 6 1 4 0 1 1 0 0 39 38 0
4 11 2 19355 9 12 1 8 5 1 0 0 0 39 4 0
5 20 2 17700 12 13 1 2 5 4 0 0 0 39 38 0
6 32 2 8536 6 4 0 7 1 1 0 0 0 15 21 0
7 35 3 13620 11 8 1 2 0 4 1 0 0 44 38 1
8 14 2 1318 12 13 2 8 1 4 0 105 0 49 38 1
9 25 2 8460 9 12 1 2 0 4 1 79 0 39 38 1

This script takes an additional argument, threshold, which should be the correlation threshold above which a feature should be removed. In this case, it is 0.9. The cleaned data is saved to a new CSV file named 'clean_data.csv'. Also, removing correlated features can sometimes lead to loss of information, so it’s important to understand the trade-offs. From our correlation chart and our data visualizations we can see that there is no data in this case that is so heavily correlated that we would need to remove it. It is still a useful thing to check for for the sake of validating and accuracy of our results.

Normal Distribution¶

In [25]:
normalisation(clean_data)

# Load the data from a CSV file
clean_data = pd.read_csv("normalized_data.csv")
print(clean_data.head(10))
        age  workclass    fnlwgt  education  education-num  marital-status  \
0  0.030778  -1.962630 -1.294106  -0.335437       1.134739        1.218073   
1  0.837509   2.053408 -1.251950  -0.335437       1.134739       -0.161128   
2 -0.042561   0.045389  0.593020   0.181332      -0.420060       -1.540329   
3  1.057526   0.045389  0.799670  -2.402511      -1.197459       -0.161128   
4 -0.775952   0.045389  1.464090  -0.335437       1.134739       -0.161128   
5 -0.115900   0.045389  1.190486   0.439716       1.523438       -0.161128   
6  0.764169   0.045389 -0.324505  -1.110590      -1.974858       -1.540329   
7  0.984187   2.053408  0.515981   0.181332      -0.420060       -0.161128   
8 -0.555934   0.045389 -1.517784   0.439716       1.523438        1.218073   
9  0.250796   0.045389 -0.337069  -0.335437       1.134739       -0.161128   

   occupation  relationship      race       sex  capital-gain  capital-loss  \
0   -1.718121     -0.277805  0.400252  0.703071      0.793942     -0.204177   
1   -1.207608     -0.900181  0.400252  0.703071     -0.279023     -0.204177   
2   -0.697096     -0.277805  0.400252  0.703071     -0.279023     -0.204177   
3   -0.697096     -0.900181 -2.310923  0.703071     -0.279023     -0.204177   
4    0.323929      2.211698 -2.310923 -1.422331     -0.279023     -0.204177   
5   -1.207608      2.211698  0.400252 -1.422331     -0.279023     -0.204177   
6    0.068672     -0.277805 -2.310923 -1.422331     -0.279023     -0.204177   
7   -1.207608     -0.900181  0.400252  0.703071     -0.279023     -0.204177   
8    0.323929     -0.277805  0.400252 -1.422331      4.227429     -0.204177   
9   -1.207608     -0.900181  0.400252  0.703071      3.111545     -0.204177   

   hours-per-week  native-country    salary  
0       -0.031122        0.263562 -0.563199  
1       -2.254475        0.263562 -0.563199  
2       -0.031122        0.263562 -0.563199  
3       -0.031122        0.263562 -0.563199  
4       -0.031122       -5.281733 -0.563199  
5       -0.031122        0.263562 -0.563199  
6       -2.007436       -2.509086 -0.563199  
7        0.380610        0.263562  1.775573  
8        0.792342        0.263562  1.775573  
9       -0.031122        0.263562  1.775573  

This code reads a CSV into a pandas DataFrame, normalizes numeric columns using z-score to mean 0 and standard deviation 1, and saves the result to a new CSV. This standardization, which makes data Gaussian with zero mean and unit variance, improves machine learning model performance and outlier detection. However, we later opted for a less intrusive standardization method, as our data had no outliers.

Both normalisation and removing correlated inputs have been removed for the final dataset, we have considered these approaches however decided not to implement these features into the original dataset used for modelling

5 - In-depth Analysis¶

SVC Modelling (Comparison Model)¶

This SVC model is the comaprison model for logistic regression. The purpose of this model was introduced to test if another model could improve the problem statement was succesfull.

Why SVC model was used¶

Support Vector Classification models excel at classifying high dimensional data. It performs well with data that contains a large number of features, however is computationally very expensive. Using Gridsearch allowed us to test a wide range of hyperparameters and ensured we would find the most accurate combination, this took a lot of time to train as shown above. The excessively high accuracy rate could mean that the model was a very good choice for this dataset or that overfitting has occurred. To attempt to avoid overfitting the data was scaled as discussed in a previous cell, however this had no noticeable impact on the model's accuracy.

In [22]:
X = df_updated.drop('salary', axis=1)  # Features
y = df_updated['salary']  # Target variable

X_train, X_test, y_train, y_test = train_test_split(df_updated, y, test_size=0.2, random_state=42)
svm_model = SVC(C=10, kernel='linear', random_state=42)
svm_model.fit(X_train, y_train)

# Evaluate the model without best Params
y_pred = svm_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

cm_new = confusion_matrix(y_test, y_pred)

# Confusion Matrix Display
plt.figure(figsize=(8, 6))
sns.heatmap(cm_new, annot=True, fmt='d', cmap='Reds', xticklabels=np.unique(y_test), yticklabels=np.unique(y_test))
plt.title('Confusion Matrix')
plt.xlabel('y_pred')
plt.ylabel('y_test')
plt.show()

# gridsearch stuff
param_grid = {
    'C': [0.001, 0.1, 1],
    'kernel': ['linear']
}

grid_search = GridSearchCV(SVC(), param_grid, verbose=2)
grid_search.fit(X_train, y_train)
print("Best parameters found: ", grid_search.best_params_)

# Use the best estimator to make predictions
y_pred = grid_search.predict(X_test)

# Evaluate the best model found by grid search
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

cm_new = confusion_matrix(y_test, y_pred)

# Confusion Matrix Display
plt.figure(figsize=(8, 6))
sns.heatmap(cm_new, annot=True, fmt='d', cmap='Reds', xticklabels=np.unique(y_test), yticklabels=np.unique(y_test))
plt.title('Confusion Matrix')
plt.xlabel('y_pred')
plt.ylabel('y_test')
plt.show()


# Model is still overfitted.
from sklearn.model_selection import cross_val_score

# Use cross_val_score for cross-validationb
svm_model = SVC(kernel='linear', C=1, random_state=42)
cv_scores = cross_val_score(svm_model, X_train, y_train, cv=5)  # Adjust the number of folds as needed

# Print average cross-validation accuracy
print("Average Cross-Validation Accuracy:", np.mean(cv_scores))
Accuracy: 0.9996926859250154
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      4905
           1       1.00      1.00      1.00      1603

    accuracy                           1.00      6508
   macro avg       1.00      1.00      1.00      6508
weighted avg       1.00      1.00      1.00      6508

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] END .............................C=0.001, kernel=linear; total time= 9.0min
[CV] END .............................C=0.001, kernel=linear; total time=14.1min
[CV] END .............................C=0.001, kernel=linear; total time=63.9min
[CV] END .............................C=0.001, kernel=linear; total time=10.9min
[CV] END .............................C=0.001, kernel=linear; total time= 9.4min
[CV] END ..............................C=0.01, kernel=linear; total time= 2.3min
[CV] END ..............................C=0.01, kernel=linear; total time= 4.2min
[CV] END ..............................C=0.01, kernel=linear; total time= 3.5min
[CV] END ..............................C=0.01, kernel=linear; total time= 2.6min
[CV] END ..............................C=0.01, kernel=linear; total time= 3.8min
[CV] END ...............................C=0.1, kernel=linear; total time= 3.1min
[CV] END ...............................C=0.1, kernel=linear; total time= 2.5min
[CV] END ...............................C=0.1, kernel=linear; total time= 2.6min
[CV] END ...............................C=0.1, kernel=linear; total time= 4.5min
[CV] END ...............................C=0.1, kernel=linear; total time= 3.4min
[CV] END .................................C=1, kernel=linear; total time= 2.5min
[CV] END .................................C=1, kernel=linear; total time= 3.8min
[CV] END .................................C=1, kernel=linear; total time= 5.2min
[CV] END .................................C=1, kernel=linear; total time= 3.1min
[CV] END .................................C=1, kernel=linear; total time= 3.1min
Best parameters found:  {'C': 0.001, 'kernel': 'linear'}
Accuracy: 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      4905
           1       1.00      1.00      1.00      1603

    accuracy                           1.00      6508
   macro avg       1.00      1.00      1.00      6508
weighted avg       1.00      1.00      1.00      6508

Average Cross-Validation Accuracy: 0.9986169582647377

Results observations¶

The high results show that the model may be overfitted. Examining data:

In [27]:
df_updated.head()
Out[27]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 22 1 2671 9 12 2 0 1 4 1 25 0 39 38 0
1 33 3 2926 9 12 1 2 0 4 1 0 0 12 38 0
2 21 2 14086 11 8 0 4 1 4 1 0 0 39 38 0
3 36 2 15336 1 6 1 4 0 1 1 0 0 39 38 0
4 11 2 19355 9 12 1 8 5 1 0 0 0 39 4 0

This shows that the data could be scaled better, particularly in the fnlwgt feature. The high range of values in the fnlwgt feature could be responsible for the odd performance of the model. The below code attempts to scale the data in an attempt to prevent overfitting:

In [28]:
# The scaling step was added after the model achieved 99.8% accuracy, pointing to possible overfitting. 
# High values in the dataset may be the reason for this. The code below scales the values in the dataset.

from sklearn.preprocessing import StandardScaler

X_train, X_test = scale_the_features(X_train, X_test)


# gridsearch stuff
param_grid = {
    'C': [0.001, 0.1, 1],
    'kernel': ['linear']
}

grid_search = GridSearchCV(SVC(), param_grid, verbose=2)
grid_search.fit(X_train, y_train)
print("Best parameters found: ", grid_search.best_params_)

# Use the best estimator to make predictions
y_pred = grid_search.predict(X_test)

# Evaluate the best model found by grid search
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

cm_new = confusion_matrix(y_test, y_pred)

# Confusion Matrix Display
plt.figure(figsize=(8, 6))
sns.heatmap(cm_new, annot=True, fmt='d', cmap='Reds', xticklabels=np.unique(y_test), yticklabels=np.unique(y_test))
plt.title('Confusion Matrix')
plt.xlabel('y_pred')
plt.ylabel('y_test')
plt.show()


# Model is still overfitted.
from sklearn.model_selection import cross_val_score

# Use cross_val_score for cross-validationb
svm_model = SVC(kernel='linear', C=1, random_state=42)
cv_scores = cross_val_score(svm_model, X_train, y_train, cv=5)  # Adjust the number of folds as needed

# Print average cross-validation accuracy
print("Average Cross-Validation Accuracy:", np.mean(cv_scores))
Accuracy: 0.9996926859250154
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      4905
           1       1.00      1.00      1.00      1603

    accuracy                           1.00      6508
   macro avg       1.00      1.00      1.00      6508
weighted avg       1.00      1.00      1.00      6508

Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV] END .............................C=0.001, kernel=linear; total time= 9.1min
[CV] END .............................C=0.001, kernel=linear; total time=14.2min
[CV] END .............................C=0.001, kernel=linear; total time=12.0min
[CV] END .............................C=0.001, kernel=linear; total time=11.1min
[CV] END .............................C=0.001, kernel=linear; total time= 9.4min
[CV] END ...............................C=0.1, kernel=linear; total time= 3.1min
[CV] END ...............................C=0.1, kernel=linear; total time= 2.5min
[CV] END ...............................C=0.1, kernel=linear; total time= 2.4min
[CV] END ...............................C=0.1, kernel=linear; total time= 4.5min
[CV] END ...............................C=0.1, kernel=linear; total time= 3.4min
[CV] END .................................C=1, kernel=linear; total time= 2.4min
[CV] END .................................C=1, kernel=linear; total time= 3.9min
[CV] END .................................C=1, kernel=linear; total time= 5.3min
[CV] END .................................C=1, kernel=linear; total time= 3.1min
[CV] END .................................C=1, kernel=linear; total time= 3.1min
Best parameters found:  {'C': 0.001, 'kernel': 'linear'}
Accuracy: 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      4905
           1       1.00      1.00      1.00      1603

    accuracy                           1.00      6508
   macro avg       1.00      1.00      1.00      6508
weighted avg       1.00      1.00      1.00      6508

Average Cross-Validation Accuracy: 0.9986169582647377

Logistic Regression (Main Model)¶

Logistic regression was the main source for solving the problem statment, which has been evaluted below.

Why logistic regression was used¶

Logistic regression is a faster machine learning method to train than SVC. Considering the size of our dataset, it made more sense to initially train a logistic regression model due to time constraints. The logistic regression machine learning model was selected as it is an efficient model to train on multi dimensional data containing a large number of features. The dataset appeared to be linearly separable, making a logistic regression algorithm a good choice. Although the model is simpler than a support vector algorithm, it has successfully learned from the dataset as evidenced in the shown confusion matrix.

First Round¶

First round of running the model, the accuracy is 81.1, we will now explore on how to improve the model.

In [26]:
X = df_updated.drop('salary', axis=1)  # Features
y = df_updated['salary']  # Target variable

# Correct train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression
clf = LogisticRegression(max_iter=10000)
clf.fit(X_train, y_train) # Training the model

# Classification Report and Accuracy
y_pred = clf.predict(X_test) # Testing the model used the test data
cm_new = confusion_matrix(y_test, y_pred)

# Confusion Matrix Display
plt.figure(figsize=(8, 6))
sns.heatmap(cm_new, annot=True, fmt='d', cmap='Reds', xticklabels=np.unique(y_test), yticklabels=np.unique(y_test))
plt.title('Confusion Matrix')
plt.xlabel('y_pred')
plt.ylabel('y_test')
plt.show()

print(classification_report(y_test, y_pred)) # Printing precision recall and other evaluation metrics
print(f"Training Accuracy: {clf.score(X_train, y_train):.3f}")
print(f"Testing Accuracy: {clf.score(X_test, y_test):.3f}")
              precision    recall  f1-score   support

           0       0.83      0.95      0.89      4942
           1       0.71      0.39      0.50      1571

    accuracy                           0.81      6513
   macro avg       0.77      0.67      0.69      6513
weighted avg       0.80      0.81      0.79      6513

Training Accuracy: 0.809
Testing Accuracy: 0.813

Receiver Operating Characteristic (ROC) Curve:¶

The ROC curve visualizes the trade-off between true positive rate (TPR, correctly classified positives) and false positive rate (FPR, incorrectly classified negatives) at different classification thresholds. The AUC score summarizes this performance across all thresholds, with a higher AUC (closer to 1) indicating better discrimination between classes (a random classifier would have an AUC of 0.5). This helps evaluate the logistic regression model's ability to distinguish between the two classes based on TPR (correctly identifying positives).

In [29]:
# Visualization 2: ROC Curve

y_prob = clf.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (AUC = {:.2f})'.format(roc_auc_score(y_test, y_prob)))
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()

Second Round Explanation¶

As we saw in data pre-processing when testing the correlation between input features, the two features are relationship and sex which had 0.58 correlation. even those this is not very high, we can see how the model performs by removing one of these features. Below we will drop relationship from our data set and run the model.

By removing 'relationship' as we can see the model performance lowered therefore this is not an ideal technique.

In [27]:
X = df_updated.drop(['relationship', 'salary'], axis=1)  # Features
y = df_updated['salary']  # Target variable

# Correct train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression
clf = LogisticRegression(max_iter=10000)
clf.fit(X_train, y_train) # Training the model

# Classification Report and Accuracy
y_pred = clf.predict(X_test)
cm_new = confusion_matrix(y_test, y_pred)

# Confusion Matrix Display
plt.figure(figsize=(8, 6))
sns.heatmap(cm_new, annot=True, fmt='d', cmap='Reds', xticklabels=np.unique(y_test), yticklabels=np.unique(y_test))
plt.title('Confusion Matrix')
plt.xlabel('y_pred')
plt.ylabel('y_test')
plt.show()

print(classification_report(y_test, y_pred))
print(f"Training Accuracy: {clf.score(X_train, y_train):.3f}")
print(f"Testing Accuracy: {clf.score(X_test, y_test):.3f}")
              precision    recall  f1-score   support

           0       0.83      0.94      0.88      4942
           1       0.68      0.40      0.50      1571

    accuracy                           0.81      6513
   macro avg       0.76      0.67      0.69      6513
weighted avg       0.79      0.81      0.79      6513

Training Accuracy: 0.807
Testing Accuracy: 0.810

Third Round Explanation¶

By removing 'sex' as we can see the model performance lowered therefore this is not an ideal technique.

In [28]:
X = df_updated.drop(['sex', 'salary'], axis=1)  # Features
y = df_updated['salary']  # Target variable

# Correct train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression
clf = LogisticRegression(max_iter=10000)
clf.fit(X_train, y_train) # Training the model

# Classification Report and Accuracy
y_pred = clf.predict(X_test)
cm_new = confusion_matrix(y_test, y_pred)

# Confusion Matrix Display
plt.figure(figsize=(8, 6))
sns.heatmap(cm_new, annot=True, fmt='d', cmap='Reds', xticklabels=np.unique(y_test), yticklabels=np.unique(y_test))
plt.title('Confusion Matrix')
plt.xlabel('y_pred')
plt.ylabel('y_test')
plt.show()

print(classification_report(y_test, y_pred))
print(f"Training Accuracy: {clf.score(X_train, y_train):.3f}")
print(f"Testing Accuracy: {clf.score(X_test, y_test):.3f}")
              precision    recall  f1-score   support

           0       0.83      0.95      0.89      4942
           1       0.70      0.39      0.50      1571

    accuracy                           0.81      6513
   macro avg       0.77      0.67      0.69      6513
weighted avg       0.80      0.81      0.79      6513

Training Accuracy: 0.809
Testing Accuracy: 0.813

Parameter Tuning¶

As part of optimising our model’s performance once the first model has been run, parameter tuning is used as shown below. This approach adjusted the models’ parameters such as the C value and max iterations to find the most optimal values to improve the models performance and make the model more effective at meeting our problem statement.

To speed up the tuning process grid search is used. This is a technique which speeds up the tuning process by finding the optimal parameters for us without having to manually trial and error different numbers to find the best result. We plug in certain possible values for the parameters, and it finds the best results. As we can see from the model accuracy below hyperparameter tuning has improved the model’s performance.

In [30]:
# Features and Target variable
X = df_updated.drop('salary', axis=1)
y = df_updated['salary']

# Correct train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression with GridSearchCV for parameter tuning
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100, 500, 1000],
    'max_iter': [1000, 1200, 1300, 1400, 1500]
}

clf = LogisticRegression(max_iter=10000, random_state=42)
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

# Train the model with the best parameters
best_clf = LogisticRegression(
    C=best_params['C'],
    max_iter=best_params['max_iter'],
    random_state=42
)
best_clf.fit(X_train, y_train)

# Classification Report and Accuracy
y_pred = best_clf.predict(X_test)
cm_new = confusion_matrix(y_test, y_pred)

# Confusion Matrix Display
plt.figure(figsize=(8, 6))
sns.heatmap(cm_new, annot=True, fmt='d', cmap='Reds', xticklabels=np.unique(y_test), yticklabels=np.unique(y_test))
plt.title('Confusion Matrix')
plt.xlabel('y_pred')
plt.ylabel('y_test')
plt.show()

print("Classification Report:")
print(classification_report(y_test, y_pred))
print(f"Training Accuracy: {best_clf.score(X_train, y_train):.3f}")
print(f"Testing Accuracy: {best_clf.score(X_test, y_test):.3f}")
Best Parameters: {'C': 100, 'max_iter': 1000}
Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.95      0.89      4942
           1       0.70      0.40      0.51      1571

    accuracy                           0.81      6513
   macro avg       0.77      0.67      0.70      6513
weighted avg       0.80      0.81      0.79      6513

Training Accuracy: 0.810
Testing Accuracy: 0.814

Fifth Round - (Standard Scalar on Logistic Regression)¶

By implementing standard scalar to logistic regression the accuracy increased from 0.813 to 0.824. This happens because logistic regression relies heavily on the linear relationships between features and the target variable. When features have different scales, it can make it harder for the model to learn these relationships effectively. Standardisation addresses this by transforming each feature to have a mean of 0 and a standard deviation of 1. This puts all features on an equal footing, allowing the model to focus on the underlying patterns in the data and leading to a more accurate prediction of the target variable.

In [31]:
X = df_updated.drop('salary', axis=1)  # Features
y = df_updated['salary']  # Target variable

# Correct train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
X_train_scaled, X_test_scaled = scale_the_data_RG_29014027.scale_the_data(X_train, X_test)

# Logistic Regression
clf = LogisticRegression(C=1000, max_iter=1300)
clf.fit(X_train_scaled, y_train) # Training the model

# Classification Report and Accuracy
y_pred = clf.predict(X_test_scaled)
cm_new = confusion_matrix(y_test, y_pred)

# Confusion Matrix Display
plt.figure(figsize=(8, 6))
sns.heatmap(cm_new, annot=True, fmt='d', cmap='Reds', xticklabels=np.unique(y_test), yticklabels=np.unique(y_test))
plt.title('Confusion Matrix')
plt.xlabel('y_pred')
plt.ylabel('y_test')
plt.show()

print(classification_report(y_test, y_pred))
print(f"Training Accuracy: {clf.score(X_train_scaled, y_train):.3f}")
print(f"Testing Accuracy: {clf.score(X_test_scaled, y_test):.3f}")
              precision    recall  f1-score   support

           0       0.84      0.94      0.89      4942
           1       0.71      0.46      0.56      1571

    accuracy                           0.82      6513
   macro avg       0.78      0.70      0.72      6513
weighted avg       0.81      0.82      0.81      6513

Training Accuracy: 0.823
Testing Accuracy: 0.824

Cross Validation¶

As we see in parameter tuning our model went up to 0.813 showing a improvement. Now we will conduct cross validation to improve out model using the optimal parameters found in hyperparamter turning. This is C=1000 and max_iterations = 1300

Once the most optimal parameters were found, cross validation was used to see if the model can be improved as a alternative to train_test_split. For this approach the k folds technique was used which splits and trains the data using multiple folds. The data set is divided into K equal size folds and each fold is trained on individually. 1 of the folds is used for testing and k-1 folds for training the model. This process is running k times with each of the k folds used once as the testing data to ensure equality. As we can see below this approach was used many times and tested using different number of folds. This experiment resulted in the k folds technique not improving the model’s performance.

In [32]:
# Features and Target variable
X = df_updated.drop('salary', axis=1)
y = df_updated['salary']

# Logistic Regression with specified parameters
clf = LogisticRegression(C= 1000, max_iter= 1300, random_state=42)

def crossValFunction(numOfFolds) :
    # Perform cross-validation with 10 folds
    folds_define = KFold(random_state=42, shuffle=True, n_splits=numOfFolds)

    # Lists to store results
    y_test_results = []
    y_pred_results = []
    score_list = []  # List to store accuracy scores for each fold
    confusion_matrices = []

    # Iterate through the folds
    for train_index, test_index in folds_define.split(X):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

        clf.fit(X_train, y_train)  # Train each fold

        y_fold_pred = clf.predict(X_test)  # Make predictions on each fold data

        fold_accuracy = accuracy_score(y_test, y_fold_pred)
        score_list.append(fold_accuracy)
        fold_confusion_matrix = confusion_matrix(y_test, y_fold_pred)

        # Append results to lists
        y_test_results.extend(y_test)
        y_pred_results.extend(y_fold_pred)

        print(f"\nFold Accuracy: {fold_accuracy:.4f}")
        print(f"Fold Confusion Matrix:\n{fold_confusion_matrix}")

    # Evaluate the overall model performance using all the data
    overall_accuracy = accuracy_score(y_test_results, y_pred_results)
    overall_confusion_matrix = confusion_matrix(y_test_results, y_pred_results)

    print(f"\nOverall Accuracy for all folds: {overall_accuracy:.4f}")
    print("\nOverall Confusion Matrix for all folds is:\n", overall_confusion_matrix)

    # Plot the boxplot of accuracy scores
    plt.boxplot(score_list)
    plt.title('Accuracy Distribution Across Folds')
    plt.ylabel('Accuracy')
    plt.show()

    # Plot the overall confusion matrix
    plt.figure(figsize=(8, 6))
    sns.heatmap(overall_confusion_matrix, annot=True, fmt='d', cmap='Reds', xticklabels=np.unique(y_test), yticklabels=np.unique(y_test))
    plt.title('Confusion Matrix')
    plt.xlabel('y_pred')
    plt.ylabel('y_test')
    plt.show()
In [33]:
crossValFunction(7)
Fold Accuracy: 0.8100
Fold Confusion Matrix:
[[3334  181]
 [ 703  434]]

Fold Accuracy: 0.8177
Fold Confusion Matrix:
[[3394  175]
 [ 673  410]]

Fold Accuracy: 0.8070
Fold Confusion Matrix:
[[3320  208]
 [ 690  434]]

Fold Accuracy: 0.8119
Fold Confusion Matrix:
[[3343  195]
 [ 680  434]]

Fold Accuracy: 0.8063
Fold Confusion Matrix:
[[3330  192]
 [ 709  420]]

Fold Accuracy: 0.8078
Fold Confusion Matrix:
[[3333  194]
 [ 700  424]]

Fold Accuracy: 0.8138
Fold Confusion Matrix:
[[3312  209]
 [ 657  473]]

Overall Accuracy for all folds: 0.8106

Overall Confusion Matrix for all folds is:
 [[23366  1354]
 [ 4812  3029]]
In [34]:
crossValFunction(10)
Fold Accuracy: 0.8096
Fold Confusion Matrix:
[[2316  140]
 [ 480  321]]

Fold Accuracy: 0.8160
Fold Confusion Matrix:
[[2351  135]
 [ 464  306]]

Fold Accuracy: 0.8179
Fold Confusion Matrix:
[[2378  135]
 [ 458  285]]

Fold Accuracy: 0.8056
Fold Confusion Matrix:
[[2305  135]
 [ 498  318]]

Fold Accuracy: 0.8050
Fold Confusion Matrix:
[[2324  137]
 [ 498  297]]

Fold Accuracy: 0.8034
Fold Confusion Matrix:
[[2348  135]
 [ 505  268]]

Fold Accuracy: 0.8090
Fold Confusion Matrix:
[[2353  137]
 [ 485  281]]

Fold Accuracy: 0.8102
Fold Confusion Matrix:
[[2315  132]
 [ 486  323]]

Fold Accuracy: 0.8084
Fold Confusion Matrix:
[[2317  161]
 [ 463  315]]

Fold Accuracy: 0.8188
Fold Confusion Matrix:
[[2335  131]
 [ 459  331]]

Overall Accuracy for all folds: 0.8104

Overall Confusion Matrix for all folds is:
 [[23342  1378]
 [ 4796  3045]]
In [35]:
crossValFunction(20)
Fold Accuracy: 0.8109
Fold Confusion Matrix:
[[1159   71]
 [ 237  162]]

Fold Accuracy: 0.8133
Fold Confusion Matrix:
[[1162   64]
 [ 240  162]]

Fold Accuracy: 0.8077
Fold Confusion Matrix:
[[1165   62]
 [ 251  150]]

Fold Accuracy: 0.8206
Fold Confusion Matrix:
[[1183   76]
 [ 216  153]]

Fold Accuracy: 0.8268
Fold Confusion Matrix:
[[1204   68]
 [ 214  142]]

Fold Accuracy: 0.8084
Fold Confusion Matrix:
[[1174   67]
 [ 245  142]]

Fold Accuracy: 0.8139
Fold Confusion Matrix:
[[1180   51]
 [ 252  145]]

Fold Accuracy: 0.8034
Fold Confusion Matrix:
[[1133   76]
 [ 244  175]]

Fold Accuracy: 0.8041
Fold Confusion Matrix:
[[1167   73]
 [ 246  142]]

Fold Accuracy: 0.8059
Fold Confusion Matrix:
[[1155   66]
 [ 250  157]]

Fold Accuracy: 0.8243
Fold Confusion Matrix:
[[1184   72]
 [ 214  158]]

Fold Accuracy: 0.7955
Fold Confusion Matrix:
[[1156   71]
 [ 262  139]]

Fold Accuracy: 0.7961
Fold Confusion Matrix:
[[1161   91]
 [ 241  135]]

Fold Accuracy: 0.8016
Fold Confusion Matrix:
[[1161   77]
 [ 246  144]]

Fold Accuracy: 0.8145
Fold Confusion Matrix:
[[1167   60]
 [ 242  159]]

Fold Accuracy: 0.8065
Fold Confusion Matrix:
[[1146   74]
 [ 241  167]]

Fold Accuracy: 0.8163
Fold Confusion Matrix:
[[1178   74]
 [ 225  151]]

Fold Accuracy: 0.7973
Fold Confusion Matrix:
[[1143   83]
 [ 247  155]]

Fold Accuracy: 0.8077
Fold Confusion Matrix:
[[1155   72]
 [ 241  160]]

Fold Accuracy: 0.8329
Fold Confusion Matrix:
[[1180   59]
 [ 213  176]]

Overall Accuracy for all folds: 0.8104

Overall Confusion Matrix for all folds is:
 [[23313  1407]
 [ 4767  3074]]

Modelling Comparison¶

Although the SVC model far outperformed the Logistic Regression model, the Logistic Regression model was less likely to be overfitted when training on the data. The Linear Regression model performed well in this use case. Comparing the two models, the Support Vector Machine had a far higher accuracy rate.

Conclusion¶

We have proven that it is possible to accurately predict if a given person makes over 50,000 per year based on several factors. A Logistic Regression machine learning model was an acceptable choice for this, and performed with 82.4% accuracy on a cleaned and prepared dataset. Future work could include experimenting with different machine learning models specialising in linearly seperable datasets such as an LSVM, which is a support vector machine specifically optimised for linear kernels. This would have reduced the training time, allowing us to experiment with more C values.

Group Reflection¶

As part of our group we held regular weekly meetings at a suitable time to establish a clear communication plan and we brainstormed solutions, and delegated tasks. Our team worked extremely well together. Following each call, each member receives a specific task to complete before the next meeting. This focused approach keeps everyone accountable and ensures steady progress. Finally, we have merged everyone's contributions into a final document, reflecting the equal effort and expertise each member brings to this project. This collaborative approach resulted in completing the coursework before the deadline and worked well within the group where everyone was satisfied with the overall result.

Where individuals were struggling or slow progress, the team worked effectively to patch up and overcome the issue collectively. This showed our team effectively worked as a team and took priority and ownership of the project. We subsequently split the group sometimes into smaller groups to work on more focused areas to ensure fast progress which worked very effectively at achieving the goal and meant more enhancements could be made to our model. Overall the team was very happy with the team work and enthusiasm shown by all team members.

We distributed roles within the group as we were discussing the nature of the tasks we wanted to undertake and the dataset we were using. This involved discussing each of the segments in the specification, such as preprocessing and making the model, and assigned them in an order of who has something left to contribute as we went along, as some tasks needed to be completed before others could start. Overall we managed to successfully distribute the work so we would all have something to contribute while completing this project.

RR 100%, RG 100%, HS 100%, KS 100%, AS 100%

RR Reflection¶

I took a lead role in driving the project forward, handling a substantial portion of the work. This included data preprocessing, exploratory data analysis (EDA), and building the logistic regression model. Additionally, I ensured clear communication and collaboration by structuring a working notebook and merging everyone's contributions into a cohesive whole.

AS Reflection¶

I worked on Gaussian distribution and removing correlated inputs, the challenges I faced included researching different standardisation methods to find which ones would best fit. By the end of this I was able to produce a working module though I had made the not to use my own standardisation methods as it was too intrusive to our datatypes.

KS Reflection¶

I used a new external package (Plotly) for part of the EDA which explored the distribution of data between males and females and their salary distributions for analysis. The challenge was using a new library not taught in lecture, therefore to learn Plotly I adhered to the documentation for help and other online resources.

RG Reflection¶

The SVM model with linear kernel was used on the dataset as it excels at making predictions on linearly separable data. Issues with this include linear kernels being more susceptive to outliers in the dataset and likely overfitting of the model. Learning about the gridsearch module saved a lot of time, automating the sequential trial of differnet C values.

HS Reflecion¶

Overall, this was a great learning experience for me. As part of my work, I created a module file to simplify the checking of duplicate values for EDA. My main responsibilities include helping come up with innovative pre-processing techniques and focused on the development of cross validation and parameter tuning for the logistic regression model. One challenge faced was finding the most optimal parameters for our model however through research a technique known as gridsearch was found to simplify the process. I believe this coursework has allowed me to understand the data science process and enhance my programming knowledge skill wish I intend to build on further in later studies.